Skip to content

ExecSolib strategy: make ddtrace.so directly executable#3711

Open
cataphract wants to merge 33 commits intomasterfrom
glopes/exec-solib
Open

ExecSolib strategy: make ddtrace.so directly executable#3711
cataphract wants to merge 33 commits intomasterfrom
glopes/exec-solib

Conversation

@cataphract
Copy link
Copy Markdown
Contributor

@cataphract cataphract commented Mar 19, 2026

Introduce the ExecSolib spawn strategy by embedding an ELF entry point (_dd_solib_start) into ddtrace.so itself.

ddtrace.so becomes pie executable and runs without the dynamic linker. After self-relocation:

  • it loads the trampoline into memory, but doesn't execute it yet
  • it copies ddtrace.so (/proc/self/exe) into a memfd and massages it so its loading by the dynamic linker without php doesn't fail.
  • replaces in the command line all occurrences to ddtrace.so to /proc/self/fd/, so that dlopen from the trampoline will use the massaged version
  • loads the dynamic linker into memory
  • jumps into the dynamic linker after massaging the auxiliary vector so that ld.so executes the loaded trampoline

tested on glibc and musl on linux aarch64

Description

Reviewer checklist

  • Test coverage seems ok.
  • Appropriate labels assigned.

@cataphract cataphract requested review from a team as code owners March 19, 2026 17:44
@datadog-datadog-prod-us1
Copy link
Copy Markdown

datadog-datadog-prod-us1 bot commented Mar 20, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 60.65% (+0.01%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: eece5c4 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.78%. Comparing base (10064f5) to head (3442a78).
⚠️ Report is 51 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3711   +/-   ##
=======================================
  Coverage   68.78%   68.78%           
=======================================
  Files         166      166           
  Lines       19015    19015           
  Branches     1792     1792           
=======================================
  Hits        13079    13079           
  Misses       5124     5124           
  Partials      812      812           
Flag Coverage Δ
helper-rust-integration 78.82% <ø> (ø)
helper-rust-unit 49.36% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10064f5...3442a78. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Introduce the ExecSolib spawn strategy by embedding an ELF entry point
(_dd_solib_start) into ddtrace.so itself.
But the import can't be declared hidden. In the end the symbol will be
in the got but linker to emits a RELATIVE reloc (not GLOB_DAT) -- so
should work with our self-relocation.
cataphract and others added 2 commits March 20, 2026 12:29
The x86-64 inline asm restoring the kernel stack and jumping to ld.so:

    "mov %[sp], %%rsp\n"
    "xor %%edx, %%edx\n"   // required: rdx = 0 for ld.so startup ABI
    "jmpq *%[entry]\n"

GCC at -O0 allocated %[entry] (ldso_entry) to rdx, causing the xor to
zero the jump target before the jmpq executed → SIGSEGV at address 0x0
on every x86-64 ExecSolib launch.

The fix is to pin ldso_entry to rax via the "a" constraint.  Using the
"rdx" clobber alone is not sufficient: GCC is permitted to allocate
input operands into clobbered registers because inputs are consumed
before the asm fires.  A specific register constraint ("a" = rax) is
the correct and optimization-safe solution.

With the fix, GCC emits:
    mov  %rcx, %rsp    ; stack_top in rcx (or any non-rax "r")
    xor  %edx, %edx    ; zero rdx (harmless: entry is in rax)
    jmpq *%rax         ; jump to ldso_entry

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Mar 20, 2026

Benchmarks [ tracer ]

Benchmark execution time: 2026-04-10 17:06:18

Comparing candidate commit eece5c4 in PR branch glopes/exec-solib with baseline commit 14dabf8 in branch master.

Found 0 performance improvements and 2 performance regressions! Performance is the same for 190 metrics, 2 unstable metrics.

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

  • 🟥 execution_time [+2.377µs; +4.203µs] or [+2.359%; +4.170%]

scenario:SamplingRuleMatchingBench/benchRegexMatching3

  • 🟥 execution_time [+54.437ns; +143.963ns] or [+3.678%; +9.727%]

@cataphract cataphract force-pushed the glopes/exec-solib branch 2 times, most recently from 76bb66d to 4e0b02e Compare March 21, 2026 12:44
@cataphract cataphract force-pushed the glopes/exec-solib branch 2 times, most recently from 4cffacf to 9293858 Compare March 22, 2026 02:16
@cataphract cataphract force-pushed the glopes/exec-solib branch 2 times, most recently from 6e1977d to 09acb37 Compare March 24, 2026 16:55
cataphract and others added 7 commits March 25, 2026 02:19
…n detection

Update libdatadog submodule to fix container ID extraction when running under
Podman with cgroupns=host. The container cgroup path includes a /container
subdirectory after the .scope suffix (e.g.
0::/machine.slice/libpod-HEXID.scope/container), which the previous regex
did not handle. This caused origin detection to fail: no entity ID was sent
to the agent, so container tags were missing from APM traces.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cataphract cataphract requested a review from a team as a code owner April 9, 2026 09:56
cataphract and others added 2 commits April 9, 2026 13:35
The trampoline binary embedded in ddtrace.so was produced as ET_EXEC
(non-PIE) by toolchains that don't default to -fPIE (e.g. devtoolset-7
on CentOS 7).  elf_load_trampoline accepted only ET_DYN and used
mmap(NULL) to pick a random load base — an ET_EXEC binary loaded that
way crashes because its absolute virtual addresses no longer match.

Two-pronged fix:
1. libdatadog/spawn_worker/build.rs: add -fPIE/-pie on Linux so the
   trampoline is always ET_DYN, matching the original design intent.
2. solib_bootstrap.c: add a __builtin_trap() guard after the ET_DYN
   check so a mis-built ET_EXEC trampoline aborts loudly instead of
   silently misbehaving.

Fixes "failed to map trampoline" (exit 121) on bookworm-slim for
PHP 8.3-8.5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cataphract and others added 8 commits April 9, 2026 16:57
…ssion

Update libdatadog submodule to include the fix for pecl-installed
extensions: access(X_OK) check before ExecSolib execve, falling back
to FdExec (fexecve via trampoline) when the +x bit is absent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to RUST_FILES

dlopen()/dlsym() crash when called from a shared library that is exec'd as
the main program via ld.so: glibc's __libc_start_main never runs, so the
internal dynamic-linker state is uninitialised.

Replace the dlsym(RTLD_DEFAULT, symbol) lookup with direct extern weak
references to ddog_daemon_entry_point / ddog_crashtracker_entry_point.
Since ssi_entry.c is compiled into libddtrace_php.so, both symbols live in
the same binary and are resolved at link time — no runtime dl machinery
needed.

Also correct the argv-layout comment: ld.so strips its own path before
calling the entry point, so the program sees argv[0]=lib_path (not ld_path).

Separately, add libdd-shared-runtime to the RUST_FILES brace-expansion in
the Makefile so that pecl/loader tests rebuild when that crate changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…i_entry.c

Three bugs in the ExecSolib SSI sidecar startup path, found by empirical
testing with minimal probe programs on ubuntu-vm:

1. Stack misalignment (root cause of the original crash)
   At process entry the kernel sets rsp % 16 == 0.  x86-64 SysV ABI
   requires rsp % 16 == 8 at C function entry (as if 'call' pushed a
   return address).  Jumping to ssi_main without adjustment meant any
   SSE instruction requiring 16-byte alignment (e.g. 'movaps' inside
   pthread_mutex_lock, which dlsym calls) would SIGSEGV.
   Fix: add 'and $-16, %rsp; sub $8, %rsp' before 'jmp ssi_main'.

2. .init_array not called
   ld.so's _dl_init skips .init_array for the main executable
   (l_name="" && l_type=lt_executable — confirmed in glibc dl-init.c:46-48).
   It expects __libc_start_main to handle it, but we never call that.
   Rust runtime initialisation (allocator, TLS, panic hooks, ...) lives
   in .init_array.  Without it, calling a Rust entry point would crash.
   Fix: run_own_init_array() walks _DYNAMIC to find DT_INIT/DT_INIT_ARRAY
   and calls them before entering Rust.  Load base comes from __ehdr_start
   (linker symbol at VMA 0 of the DSO).

3. dlsym replaced with direct extern weak references
   dlsym would work with a properly aligned stack, but is unnecessary
   since ssi_entry.c is compiled into the same libddtrace_php.so as
   ddog_daemon_entry_point / ddog_crashtracker_entry_point.  Direct
   extern weak references are resolved at link time — simpler and with
   no runtime overhead.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both ddog_daemon_entry_point and ddog_crashtracker_entry_point are always
present in libddtrace_php.so; weak declarations would silently produce a
NULL call instead of a link error if they were ever missing.  Use plain
extern declarations so the linker catches absent symbols at build time.

Also remove the now-disproved claim that dlsym cannot work from the entry
point — the actual issue was stack misalignment, not missing glibc init.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p) signature

Per ELF spec / glibc ldsodefs.h dl_init_t, .init_array constructors receive
(int argc, char **argv, char **envp).  Previously they were called as void(*)(void),
which works in practice on SysV ABIs (extra args in registers, callee ignores them)
but is technically UB.  Pass the real values; envp is derived from the initial
stack layout (argv + argc + 1) before argc/argv are adjusted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Some clang configurations (cc crate passes --target=x86_64-unknown-linux-gnu
which changes header search paths) cannot find system elf.h, causing build
failures in CI (11 compile errors: ElfW undeclared, _DYNAMIC undeclared, etc.).

Replace ElfW(Dyn) / DT_* with a self-contained SsiDyn struct and inline
#define constants.  ssi_entry.c targets Linux LP64 only (x86-64 + aarch64),
so intptr_t/uintptr_t match the 64-bit ELF Dyn layout exactly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
libdd-libunwind-sys/build.rs uses #[path = "buildscript/linux.rs"] to include
platform-specific build logic, but the RUST_FILES find filter in the Makefile
only matched */src*, */build.rs, etc. — not */buildscript*.

The generated pecl .tgz therefore lacked buildscript/{linux,macos,windows}.rs,
causing "couldn't read libdd-libunwind-sys/buildscript/linux.rs: No such file"
when pecl install tried to compile the Rust code.

Add -path "*/buildscript*" to the find filter.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants